Name: Huimiao Chen, JHED ID: hchen185, Email: hchen185@jhu.edu.
(1) Look at the chapter on interactive graphics and, specifically, the code to display a subject's MRICloud data as a sunburst plot. Do the following. Display this subject's data as a Sankey diagram. Display as many levels as you can (at least 3) for Type = 1, starting from the intracranial volume. Put this in a file called hw4.ipynb.
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import random
# load data and pre-process
## data urls
url_1 = "https://raw.githubusercontent.com/smart-stats/ds4bio_book/main/book/assetts/kirby21AllLevels.csv"
url_2 = "https://raw.githubusercontent.com/bcaffo/MRIcloudT1volumetrics/master/inst/extdata/multilevel_lookup_table.txt"
## load in the hierarchy information
multilevel_lookup = pd.read_csv(url_2, sep = "\t").drop(['Level5'], axis = 1)
multilevel_lookup = multilevel_lookup.rename(columns = {
"modify" : "roi",
"modify.1" : "level4",
"modify.2" : "level3",
"modify.3" : "level2",
"modify.4" : "level1"})
multilevel_lookup = multilevel_lookup[['roi', 'level4', 'level3', 'level2', 'level1']]
## load in the subject data
id = 127
subjectData = pd.read_csv(url_1)
subjectData = subjectData.loc[(subjectData.type == 1) & (subjectData.level == 5) & (subjectData.id == id)]
subjectData = subjectData[['roi', 'volume']]
## merge the subject data with the multilevel data
subjectData = pd.merge(subjectData, multilevel_lookup, on = "roi")
subjectData = subjectData.assign(icv = "ICV")
subjectData = subjectData.assign(comp = subjectData.volume / np.sum(subjectData.volume))
## print data
print(subjectData)
# prepare data for Sankey diagram
## initialize an empty dictionary
data_dict = {}
## prepare node names and colors
data_dict["node"] = {"label": [], "color": []}
### prepare node names
cols = ["icv", "level1", "level2", "level3", "level4", "roi"]
node_matrix = subjectData.loc[:, cols].values
for i in range(len(node_matrix)): # add level names as prefixes to avoid same names from different levels
for j, level in enumerate(cols):
node_matrix[i][j] = level + ":" + node_matrix[i][j] # the prefixes will be deleted later
node_names = node_matrix.flatten(order='F').tolist()
data_dict["node"]["label"] = list(set(node_names))
### prepare node colors
opacity_node = 0.8
colors = [(random.randint(0, 255), random.randint(0, 255), random.randint(0, 255), opacity_node)
for i in range(len(data_dict["node"]["label"]))] # generate RGBA color tuples with opacity
color_strings = ['rgba({}, {}, {}, {})'.format(r, g, b, a)
for r, g, b, a in colors] # convert the RGBA color tuples to a list of RGBA color strings
data_dict["node"]["color"] = color_strings
## prepare link sources, targets, values, and colors
data_dict["link"] = {"source": [], "target": [], "value": [], "color": []}
label_No = {} # initialize node numbers dict
for index_, label_ in enumerate(data_dict["node"]["label"]):
label_No[label_] = index_ # assign numbers to nodes
cols = ["icv", "level1", "level2", "level3", "level4", "roi", "comp"]
subjectData_new = subjectData.loc[:, cols]
for i in subjectData_new.index:
for j in subjectData_new.columns[1:-1]: # exclusive of "icv" and "comp"
source_name = cols[cols.index(j) - 1] + ":" + subjectData_new.loc[i, cols[cols.index(j) - 1]]
source_number = label_No[source_name]
target_name = j + ":" + subjectData_new.loc[i, j]
target_number = label_No[target_name]
value = subjectData_new.loc[:, cols].groupby(j).sum().loc[subjectData_new.loc[i, j],"comp"]
if_break = False # begin check whether the data has been stored
for a, b, c in zip(data_dict["link"]["source"], data_dict["link"]["target"], data_dict["link"]["value"]):
if a == source_number and b == target_number and c == value:
if_break = True
if if_break:
pass
else:
data_dict["link"]["source"].append(source_number)
data_dict["link"]["target"].append(target_number)
data_dict["link"]["value"].append(value)
opacity_link = 0.4
data_dict["link"]["color"] = [data_dict["node"]["color"][src].replace(str(opacity_node), str(opacity_link))
for src in data_dict["link"]["source"]]
for index_, label_ in enumerate(data_dict["node"]["label"]): # the prefixes of node names are deleted
data_dict["node"]["label"][index_] = label_.split(":")[1]
# plot Sankey diagram
fig = go.Figure(data=[go.Sankey(
# valueformat = ".0000f",
valuesuffix = "(comp)",
# Define nodes
node = dict(
pad = 15,
thickness = 15,
line = dict(color = "black", width = 0.5),
label = data_dict['node']['label'],
color = data_dict['node']['color']
),
# Add links
link = dict(
source = data_dict['link']['source'],
target = data_dict['link']['target'],
value = data_dict['link']['value'],
color = data_dict['link']['color']
))])
fig.update_layout(title_text="Structure of the intracranial volume<br>Source: Multi-level MRICloud data: <a href='https://raw.githubusercontent.com/smart-stats/ds4bio_book/main/book/assetts/kirby21AllLevels.csv'>kirby21AllLevels</a>",
font_size=10)
fig.show()
# Note: The ipynb file will not display the interactive graphics unless you rerun the code.
# The interactive graphics can be accessed via the urls in Q2.
roi volume level4 level3 \
0 SFG_L 12926 SFG_L Frontal_L
1 SFG_R 10050 SFG_R Frontal_R
2 SFG_PFC_L 12783 SFG_L Frontal_L
3 SFG_PFC_R 11507 SFG_R Frontal_R
4 SFG_pole_L 3078 SFG_L Frontal_L
.. ... ... ... ...
275 Chroid_LVetc_L 444 AnteriorLateralVentricle_L LateralVentricle_L
276 Chroid_LVetc_R 371 AnteriorLateralVentricle_R LateralVentricle_R
277 IV_ventricle 2700 IV_ventricle IV_ventricle
278 ECCL_L 292 inf_DPWM_L InferiorWM_L
279 ECCL_R 292 inf_DPWM_R InferiorWM_R
level2 level1 icv comp
0 CerebralCortex_L Telencephalon_L ICV 0.009350
1 CerebralCortex_R Telencephalon_R ICV 0.007270
2 CerebralCortex_L Telencephalon_L ICV 0.009247
3 CerebralCortex_R Telencephalon_R ICV 0.008324
4 CerebralCortex_L Telencephalon_L ICV 0.002227
.. ... ... ... ...
275 Ventricle CSF ICV 0.000321
276 Ventricle CSF ICV 0.000268
277 Ventricle CSF ICV 0.001953
278 WhiteMatter_L Telencephalon_L ICV 0.000211
279 WhiteMatter_R Telencephalon_R ICV 0.000211
[280 rows x 8 columns]
(2) Create a simple webpage containing this graphic and host it on github pages. -Do not- host this off of your assignment repo from github classroom, since this is not public. Instead, you'll have to create a new public repo from your regular github account and add this file. Put the link to your live web page in a markdown cell of your hw4.ipynb file. Note, an easy way to create a webpage with this graphic is to export an ipynb as an html file.
# method 1: generate an html file using the folloing code
# method 2: directly save the .ipynb file as a html file,
# which will keep the figure drawn in Q1 interactive.
fig.write_html(file="HW4_SankeyDiagram.html")
The html files generated by the two methods mentioned in the last chunk can be accessed in the following two urls from my public github repo:
(3) Create the opioid sqlite database from https://smart-stats.github.io/ds4bio_book/book/_build/html/sqlite.html. However, only go to the step where the csv files are read into the database. Then exit sqlite and you should have a file opioid.db that has the data. Next, read the three tables into pandas dataframes and do the data wrangling from the sqlite chapter directly in pandas. Add the python code to your hw4.ipynb file.
After downloading the three csv files into the working directory, I run the following commands in the prompt/command line to obtain the required opioid.db file, which is the input data of the code in the next chunk.
jupyter-hchen185@ip-172-31-29-195:~/HW4$ sqlite3 opioid.db
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite> .mode csv
sqlite> .import county_pop_arcos.csv population
sqlite> .import county_annual.csv annual
sqlite> .import land_area.csv land
sqlite> .quit
import sqlite3 as sq3
import pandas as pd
# read data into pandas
## creat sql connection
con = sq3.connect("opioid.db")
# read the three tables into pandas dataframes
population = pd.read_sql_query("SELECT * from population", con)
annual = pd.read_sql_query("SELECT * from annual", con)
land = pd.read_sql_query("SELECT * from land", con)
## close sql connection
con.close
# print data heads
print("---*-data preview starts-*---")
print("-*-The following is the head of the population table-*-")
print(population.head())
print("---*---*---")
print("-*-The following is the head of the annual table-*-")
print(annual.head())
print("---*---*---")
print("-*-The following is the head of the land table-*-")
print(land.head())
print("---*-data preview ends-*---")
# data wrangling
print("---*-data wrangling starts-*---")
## print out a few columns of the population table
data_select = population.loc[:, ["BUYER_COUNTY", "BUYER_STATE", "STATE", "COUNTY", "year", "population"]]
print("-*-The following is the head of a few columns in the population table-*-")
print(data_select.head(5))
print("---*---*---")
## missing data in the annual table
data_miss = annual.loc[annual.countyfips == "NA"]
print("-*-The following is the head of missing data in the annual table-*-")
print(data_miss.head(10))
print("---*---*---")
## missing data in the annual table other than Puerto Rico (PR)
data_miss_noPR = annual.loc[(annual.countyfips == "NA") & (annual.BUYER_STATE != "PR")]
print("-*-The following is the head of missing data in the annual table other than Puerto Rico (PR)-*-")
print(data_miss_noPR.head(10))
print("---*---*---")
## set the Montgometry countyfips to the correct value 05097 and ignore other missing values
print("-*-The following is the data of Montgomery county AR in the annual table-*-")
print(annual[(annual.BUYER_STATE == "AR") & (annual.BUYER_COUNTY == "MONTGOMERY")])
annual.loc[(annual.BUYER_STATE == "AR") & (annual.BUYER_COUNTY == "MONTGOMERY"), "countyfips"] = 5097 # update "annual"
print("-*-The following is the data of Montgomery county AR in the revised annual table-*-")
print(annual[(annual.BUYER_STATE == "AR") & (annual.BUYER_COUNTY == "MONTGOMERY")])
print("---*---*---")
## delete rows from the annual table that have missing county data
print("-*-The following are the rows with BUYER_COUNTY==NA in the annual table-*-")
print(annual[annual.BUYER_COUNTY=="NA"])
annual.drop(annual[annual.BUYER_COUNTY=="NA"].index, inplace=True)
print("-*-The following are the rows with BUYER_COUNTY==NA in the revised annual table-*-")
print(annual[annual.BUYER_COUNTY=="NA"]) # expected to be empty
print("---*---*---")
## grab three columns from the land table and rename the column STCOU to coutyfips
land_area = land.loc[:, ["Areaname","STCOU","LND110210D"]]
land_area.rename(columns={"STCOU":"countyfips"}, inplace=True)
print("-*-The following are the head of the land_area table-*-")
print(land_area.head())
print("---*---*---")
## join the population and land_area tables to create a new table county_info
county_info = population.merge(land_area, how="left", on="countyfips")
print("-*-The following are the head of the county_info table-*-")
print(county_info.head())
print("---*---*---")
## print out the counts to make sure we accounted correctly
print("the count of the land table is:", land.shape[0])
print("the count of the land_area table is:", land_area.shape[0])
print("the count of the county_info table is:", county_info.shape[0])
print("the count of the population table is:", population.shape[0])
---*-data preview starts-*---
-*-The following is the head of the population table-*-
BUYER_COUNTY BUYER_STATE countyfips STATE COUNTY county_name \
0 1 AUTAUGA AL 01001 1 1 Autauga
1 2 BALDWIN AL 01003 1 3 Baldwin
2 3 BARBOUR AL 01005 1 5 Barbour
3 4 BIBB AL 01007 1 7 Bibb
4 5 BLOUNT AL 01009 1 9 Blount
NAME variable year population
0 Autauga County, Alabama B01003_001 2006 51328
1 Baldwin County, Alabama B01003_001 2006 168121
2 Barbour County, Alabama B01003_001 2006 27861
3 Bibb County, Alabama B01003_001 2006 22099
4 Blount County, Alabama B01003_001 2006 55485
---*---*---
-*-The following is the head of the annual table-*-
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
0 1 ABBEVILLE SC 2006 877 363620 45001
1 2 ABBEVILLE SC 2007 908 402940 45001
2 3 ABBEVILLE SC 2008 871 424590 45001
3 4 ABBEVILLE SC 2009 930 467230 45001
4 5 ABBEVILLE SC 2010 1197 539280 45001
---*---*---
-*-The following is the head of the land table-*-
Areaname STCOU LND010190F LND010190D LND010190N1 LND010190N2 \
0 1 UNITED STATES 00000 0 3787425.08 0000 0000
1 2 ALABAMA 01000 0 52422.94 0000 0000
2 3 Autauga, AL 01001 0 604.49 0000 0000
3 4 Baldwin, AL 01003 0 2027.08 0000 0000
4 5 Barbour, AL 01005 0 904.59 0000 0000
LND010200F LND010200D LND010200N1 ... LND110210N1 LND110210N2 LND210190F \
0 0 3794083.06 0000 ... 0000 0000 0
1 0 52419.02 0000 ... 0000 0000 0
2 0 604.45 0000 ... 0000 0000 0
3 0 2026.93 0000 ... 0000 0000 0
4 0 904.52 0000 ... 0000 0000 0
LND210190D LND210190N1 LND210190N2 LND210200F LND210200D LND210200N1 \
0 251083.35 0000 0000 0 256644.62 0000
1 1672.71 0000 0000 0 1675.01 0000
2 8.48 0000 0000 0 8.48 0000
3 430.55 0000 0000 0 430.58 0000
4 19.59 0000 0000 0 19.61 0000
LND210200N2
0 0000
1 0000
2 0000
3 0000
4 0000
[5 rows x 35 columns]
---*-data preview ends-*---
---*-data wrangling starts-*---
-*-The following is the head of a few columns in the population table-*-
BUYER_COUNTY BUYER_STATE STATE COUNTY year population
0 AUTAUGA AL 1 1 2006 51328
1 BALDWIN AL 1 3 2006 168121
2 BARBOUR AL 1 5 2006 27861
3 BIBB AL 1 7 2006 22099
4 BLOUNT AL 1 9 2006 55485
---*---*---
-*-The following is the head of missing data in the annual table-*-
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
187 188 ADJUNTAS PR 2006 147 102800 NA
188 189 ADJUNTAS PR 2007 153 104800 NA
189 190 ADJUNTAS PR 2008 153 45400 NA
190 191 ADJUNTAS PR 2009 184 54200 NA
191 192 ADJUNTAS PR 2010 190 56200 NA
192 193 ADJUNTAS PR 2011 186 65530 NA
193 194 ADJUNTAS PR 2012 138 57330 NA
194 195 ADJUNTAS PR 2013 138 65820 NA
195 196 ADJUNTAS PR 2014 90 59490 NA
196 197 AGUADA PR 2006 160 49200 NA
---*---*---
-*-The following is the head of missing data in the annual table other than Puerto Rico (PR)-*-
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
10071 10072 GUAM GU 2006 319 265348 NA
10072 10073 GUAM GU 2007 330 275600 NA
10073 10074 GUAM GU 2008 313 286900 NA
10074 10075 GUAM GU 2009 390 355300 NA
10075 10076 GUAM GU 2010 510 413800 NA
10076 10077 GUAM GU 2011 559 475600 NA
10077 10078 GUAM GU 2012 616 564800 NA
10078 10079 GUAM GU 2013 728 623200 NA
10079 10080 GUAM GU 2014 712 558960 NA
17429 17430 MONTGOMERY AR 2006 469 175390 NA
---*---*---
-*-The following is the data of Montgomery county AR in the annual table-*-
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
17429 17430 MONTGOMERY AR 2006 469 175390 NA
17430 17431 MONTGOMERY AR 2007 597 241270 NA
17431 17432 MONTGOMERY AR 2008 561 251760 NA
17432 17433 MONTGOMERY AR 2009 554 244160 NA
17433 17434 MONTGOMERY AR 2010 449 247990 NA
17434 17435 MONTGOMERY AR 2011 560 313800 NA
17435 17436 MONTGOMERY AR 2012 696 339520 NA
17436 17437 MONTGOMERY AR 2013 703 382300 NA
17437 17438 MONTGOMERY AR 2014 491 396900 NA
-*-The following is the data of Montgomery county AR in the revised annual table-*-
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
17429 17430 MONTGOMERY AR 2006 469 175390 5097
17430 17431 MONTGOMERY AR 2007 597 241270 5097
17431 17432 MONTGOMERY AR 2008 561 251760 5097
17432 17433 MONTGOMERY AR 2009 554 244160 5097
17433 17434 MONTGOMERY AR 2010 449 247990 5097
17434 17435 MONTGOMERY AR 2011 560 313800 5097
17435 17436 MONTGOMERY AR 2012 696 339520 5097
17436 17437 MONTGOMERY AR 2013 703 382300 5097
17437 17438 MONTGOMERY AR 2014 491 396900 5097
---*---*---
-*-The following are the rows with BUYER_COUNTY==NA in the annual table-*-
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
27741 27742 NA AE 2006 2 330 NA
27742 27743 NA CA 2006 47 12600 NA
27743 27744 NA CT 2006 305 78700 NA
27744 27745 NA CT 2007 112 30900 NA
27745 27746 NA CT 2008 48 15000 NA
27746 27747 NA FL 2006 9 900 NA
27747 27748 NA FL 2007 7 700 NA
27748 27749 NA GA 2006 114 51700 NA
27749 27750 NA IA 2006 7 2300 NA
27750 27751 NA IN 2006 292 39300 NA
27751 27752 NA MA 2006 247 114900 NA
27752 27753 NA NV 2006 380 173600 NA
27753 27754 NA NV 2007 447 200600 NA
27754 27755 NA NV 2008 5 2200 NA
27755 27756 NA OH 2006 23 5100 NA
27756 27757 NA PR 2006 10 17800 NA
27757 27758 NA PR 2007 2 1300 NA
-*-The following are the rows with BUYER_COUNTY==NA in the revised annual table-*-
Empty DataFrame
Columns: [, BUYER_COUNTY, BUYER_STATE, year, count, DOSAGE_UNIT, countyfips]
Index: []
---*---*---
-*-The following are the head of the land_area table-*-
Areaname countyfips LND110210D
0 UNITED STATES 00000 3531905.43
1 ALABAMA 01000 50645.33
2 Autauga, AL 01001 594.44
3 Baldwin, AL 01003 1589.78
4 Barbour, AL 01005 884.88
---*---*---
-*-The following are the head of the county_info table-*-
BUYER_COUNTY BUYER_STATE countyfips STATE COUNTY county_name \
0 1 AUTAUGA AL 01001 1 1 Autauga
1 2 BALDWIN AL 01003 1 3 Baldwin
2 3 BARBOUR AL 01005 1 5 Barbour
3 4 BIBB AL 01007 1 7 Bibb
4 5 BLOUNT AL 01009 1 9 Blount
NAME variable year population Areaname \
0 Autauga County, Alabama B01003_001 2006 51328 Autauga, AL
1 Baldwin County, Alabama B01003_001 2006 168121 Baldwin, AL
2 Barbour County, Alabama B01003_001 2006 27861 Barbour, AL
3 Bibb County, Alabama B01003_001 2006 22099 Bibb, AL
4 Blount County, Alabama B01003_001 2006 55485 Blount, AL
LND110210D
0 594.44
1 1589.78
2 884.88
3 622.58
4 644.78
---*---*---
the count of the land table is: 3198
the count of the land_area table is: 3198
the count of the county_info table is: 28265
the count of the population table is: 28265
(4) Create an interactive scatter plot of average number of opiod pills by year plot using plotly. See the example here. Don't do the intervals (little vertical lines), only the points. Add your plot to an html file with your repo for your Sanky diagram and host it publicly. Put a link to your hosted file in a markdown cell of your hw4.ipynb file. Note, an easy way to create a webpage with this graphic is to export an ipynb as an html file.
import plotly.express as px
# convert data into float point numbers
annual["DOSAGE_UNIT"] = annual["DOSAGE_UNIT"].astype(float) # some original data in this column is not float
# calculate the average numbers
data_scatter = annual.groupby('year').mean()
# plot the interactive scatter plot of average number of opiod pills by year plot
fig = px.scatter(data_scatter)
fig.update_layout(title_text="Interactive scatter plot of average number of opiod pills by year", font_size=12)
fig.show()
# wirte into the html file
fig.write_html(file="HW4_ScatterPlot.html")
The html files generated for the interactive scatter plot can be accessed in the following two urls from my public github repo (one is generated by wirte_html and the other is generated by exporting the ipynb as an html file):